强化学习的不同
- 不需要监督者,紧紧一个奖励信号
- 反馈不是及时的
- time really matters
problem in RL
Rewards
a scalar feedback signal
最小时间内最大化奖励
state
- agent state
- environment state
- Fully observability
- partial observability
智能体的三要素
Policy: agent’s behaviour function
a map from state to action, $a=\pi(s)$
stochastic policy: $\pi(a|s)=P[A_t=a|S_t=s]$
value function: how good is each state and/or action
- model: agent’s representation of the environment
Exploration(探索)和Exploitation(开发)
Prediction和control
- prediction,给定策略预测未来,计算值函数
- control,找到最佳策略,最大化未来收益,计算值函数的同时更新策略,使得策略最优
评论加载中